Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Quantization of Neural Networks

where GIoU(·) is the generalized intersection over union function [202]. Each Gi reﬂects the

“closeness” of student proposals to the i-th ground-truth object. Then, we retain highly qual-

iﬁed student proposals around at least one ground truth to beneﬁt object recognition [235]

as:

b^S

j ⁼

b^S

j ^,

GIoU(b^GT

, b^S

j ⁾^{> τG}ⁱ^,^∀ⁱ

∅,

otherwise,

(2.34)

where τ is a threshold controlling the proportion of distilled queries. After removing object-

empty (∅) queries in ˜q^S, we form a distillation-desired query set of students denoted as ˜q^S

associated with its object set ˜y^S= {˜c^S

j ^,^˜^b^S

j ^}^˜

j=1^{. Correspondingly, we can obtain a teacher}

query set ˜y^T= {˜c^T

j ^,^˜^b^T

j ^}^˜

j=1^{. For the}^j^{-th student query, its corresponding teacher query is}

matched as:

˜c^T

j ^,^˜^b^T

j ^{= arg max}

˜c^T

k ^,^˜^b^T

k=1

μ1 GIoU(^˜b^S

j ^{, b}^T

k ⁾⁻^μ²^∥^˜^b^S

j ⁻^b^T

k ^∥¹^,

(2.35)

where μ1 = 2 and μ2 = 5 control the matching function, values of which is to follow [31].

Finally, the upper-level optimization after rectiﬁcation in Eq. (2.29) becomes:

min

H(˜q^S^∗|˜q^T).

(2.36)

Optimizing Eq. (2.36) is challenging. Alternatively, we minimize the norm distance be-

tween ˜q^S^∗and ˜q^T, optima of which, i.e., ˜q^S^∗= ˜q^T, is exactly the same with that in

Eq. (2.36). Thus, the ﬁnal loss for our distribution rectiﬁcation distillation loss becomes:

LDRD(˜q^S^∗, ˜q^T) = E[∥^˜D^S^∗−^˜D^T∥2],

(2.37)

where we use the Euclidean distance of co-attented feature ^˜D (see Eq. 2.26) containing the

information query ˜q for optimization.

In backward propagation, the gradient updating drives the student queries toward their

teacher hints. Therefore, we accomplish our distillation. The overall training losses for our

Q-DETR model are:

L = LGT (y^GT, y^S) + λLDRD(˜q^S^∗, ˜q^T),

(2.38)

where LGT is the common detection loss for missions such as proposal classiﬁcation and

coordinate regression [31], and λ is a trade-oﬀhyper-parameter.

2.4.5

Ablation Study

Datasets. We ﬁrst conduct the ablative study and hyper-parameter selection on the PAS-

CAL VOC dataset [62], which contains natural images from 20 diﬀerent classes. We use

the VOC trainval2012, and VOC trainval2007 sets to train our model, which contains

approximately 16k images, and the VOC test2007 set to evaluate our Q-DETR, which

contains 4952 images. We report COCO-style metrics for the VOC dataset: AP, AP50 (de-

fault VOC metric), and AP75. We further conduct the experiments on the COCO 2017

[145] object detection tracking. Speciﬁcally, we train the models on COCO train2017

and evaluate the models on COCO val2017. We list the average precision (AP) for

IoUs∈[0.5 : 0.05 : 0.95], designated as AP, using COCO’s standard evaluation metric.

For further analyzing our method, we also list AP50, AP75, APs, APm, and APl.

Implementation Details. Our Q-DETR is trained with the DETR [31] and SMCA-

DETR [70] framework. We select the ResNet-50 [84] and modify it with Pre-Activation

structures and RPReLU [158] function following [155]. PyTorch [185] is used for imple-

menting Q-DETR. We run the experiments on 8 NVIDIA Tesla A100 GPUs with 80 GB